Class Workbook

Ames Housing data

We will look at the Ames Housing data. The task is to predict the houses after 2008 based on data up to 2008.

library(AmesHousing)
data(ames_raw,package = "AmesHousing")
ames_raw_2008=ames_raw[ames_raw$`Yr Sold`<2008,]
ames_raw_2009=ames_raw[ames_raw$`Yr Sold`>=2008,]

The loss will be the same as before. If your algorithm decides to pay more than the actual price your company buys. If the predicted price is lower, your company will fail to buy.

  • If you bought for more than the actual value, you’ve overpaid.
  • If you bid less and lost, you lost a profit of 10% of the house price.
calc_loss<-function(prediction,actual){
  difpred <- actual-prediction
  RMSE <-sqrt(mean(difpred^2))
  operation_loss<-abs(sum(difpred[difpred<0]))+sum(0.1*actual[difpred>0])
  return(
    list(RMSE,operation_loss
         )
  )
}

Feature engineering

Types of Feature engineering

There are several categories of feature engineering.

  1. Adding information from other sources
  2. Missing Data Handling
  3. Dealing with problematic values (outliers, inliers, etc)
  4. Making variables that make sense for the context
  5. Transformation
  6. Scaling
  7. Discretization

1. Adding information from other sources

When handed a dataset, it’s easy to jump right into the analysis. This is typical behavior, especially for a novice. However, there is often information that could be explored if you know what you are looking for. There are a few categories of such information.

  1. Information that was not given to you but someone has access to.

When you are not the data creator, sometimes you are not given access to certain information. The most common is information that pertains to privacy or protected attributes. This information is often not given to you for reasons external to the project you are working on. However, in certain circumstances, if you know what you are looking for, you might be able to negotiate information that could save you some headaches down the line. Think outside the box and be creative. The important caveat is that obtaining some information could have legal consequences. Web scraping and other means of data collection should be done with care. Some industry such as pharmacies have strict rule that prohibits the use of pharmacy information for their retail purpose.

  1. Information that is public but you need to obtain.

There are information about places and things on the internet that are easy to incorporate. For example, in housing data, geographic information could be tied to census information. Financial information might require adjusting for inflation, which again can be found on the internet. Other survey information might be available if you care to look for them. One thing to be careful is that not all information that you can find will be useful. You need to balance the time needed vs the benefit of the information.

  1. Information that is confusing for machines

Coded variables without keys do not make sense but for a computer they seem like a numeric variable. If not careful, one might include them as numeric. Take MS SubClass, which codes the building class.

table(ames_raw$`MS SubClass`)
## 
##  020  030  040  045  050  060  070  075  080  085  090  120  150  160  180  190 
## 1079  139    6   18  287  575  128   23  118   48  109  192    1  129   17   61

Unfortunately, the help file does not contain detailed information on the codes. But with some research you will be able to find that codes do not have ordering to them. Therefore, you need to think carefully about what matters and then discretize the variable in some ways.

  • 20 1-STORY 1946 & NEWER ALL STYLES
  • 30 1-STORY 1945 & OLDER
  • 40 1-STORY W/FINISHED ATTIC ALL AGES
  • 45 1-1/2 STORY - UNFINISHED ALL AGES
  • 50 1-1/2 STORY FINISHED ALL AGES
  • 60 2-STORY 1946 & NEWER
  • 70 2-STORY 1945 & OLDER
  • 75 2-1/2 STORY ALL AGES
  • 80 SPLIT OR MULTI-LEVEL
  • 85 SPLIT FOYER
  • 90 DUPLEX - ALL STYLES AND AGES
  • 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
  • 150 1-1/2 STORY PUD - ALL AGES
  • 160 2-STORY PUD - 1946 & NEWER
  • 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
  • 190 2 FAMILY CONVERSION - ALL STYLES AND AGES

2. Missing Data Handling

To handle missing data, it’s always essential to consider the context. Data that is missing is not by themselves a problem. The fundamental problem is the bias that these variable might pose down the line if incorporated. Doing a careful imputation takes effort. When time is of a concern, deleting variables with high rate of missingness should be considered.

  1. Missing data that is not really missing Variable such as Garage Yr Blt has 159 observations missing. But if you look carefully, you will realize that the houses that are missing this information are the ones that have no garage. This is not missing data but a coding problem. One must decide what to do with such information based on the context. You should not fill such missingness with some arbitrary number.
table(ames_raw$`Garage Cars`,is.na(ames_raw$`Garage Yr Blt`))
##    
##     FALSE TRUE
##   0     0  157
##   1   777    1
##   2  1603    0
##   3   374    0
##   4    16    0
##   5     1    0
  1. Missing data that is too big Some variables might have too much missing data, and there may be a good reason for that. If there are ways to craft a variable that could serve as a proxy for such information, one should try. But if such effort introduces additional uncertainty, one might remove the variable altogether.
missing_data_proportion<-colMeans(is.na(ames_raw))
plot(missing_data_proportion)

which(missing_data_proportion>0.1) 
## Lot Frontage        Alley Fireplace Qu      Pool QC        Fence Misc Feature 
##            5            8           59           74           75           76
  1. Missing data that could be an additional information

If missingness is intentional, one might add a variable to signify such missingness. You will need to fill the missing value with some value, which depends on the variable.

  1. Missing completely at random (MCAR)

If MCAR, one could remove the rows with missingness without introducing bias. However, this is a strong assumption that is often not met in practice.

  1. Missing at Random (MAR)

For MAR, regression-based imputation often is used. Many packages allow you to do these imputations reasonably easily. However, one thing that you will need to think about is that some imputation method will work better after transformation then before. This will rely on the model being used to impute. See mi, mice, etc for detail.

  1. Missing not at random (MNAR)

MNAR variable is hard to deal with. One needs to weigh the cost and benefit of including such variables. An example of such is a variable like income. If all the low-income people are not responding, one might use a small number as a proxy. But if there are reasons to believe there multiple categories of cause they are missing, and there is no way to tell, then you might be better off removing the variable.

3. Dealing with problematic values (outliers, inliers, etc)

Problematic observations such as outliers are hard to find and often require you to revisit this step a few times. This is important because you must deal with them before applying transformations. For example, outliers would distort statistics such as means which would be problematic if you plan to use z-score transformation. When you have a lot of zeros, this could impact how you want to transform a variable. EDA often finds outliers, but they may not pop up until the modeling phase. Truncating or removing data with outliers should be done with caution since they often introduce an unwanted feature in the data.

Here is an illustration of two types of outliers that are harder and easier to find.

dat<-rmvnorm(100,c(0,0),matrix(c(3,2,2,3),2,2))
dat<-rbind(dat,c(7,7), c(-3,4))
par(mfrow=c(1,3))
plot(dat[,1],dat[,2],col=c(rep(1,100),2,4))
plot(dat[,1],col=c(rep(1,100),2,4));
plot(dat[,2],col=c(rep(1,100),2,4))

Look at the basement and the 2nd floor Square footage, you can see that there are bimodality as well as properties that have outliers. This should make you cautious of performing scaling to these variables.

 plot(ames_raw$`Bsmt Unf SF`,ames_raw$`2nd Flr SF`)

4. Making variables that make sense for the context

Context matters when doing feature engineering. Take, for example, the Ames housing data. Ames is a university town where many people have some ties to the university of Iowa. Therefore, looking at things like distance from the university might make sense to include in the analysis. Another thing to think about is things like the Year built. The impact of the year built is not absolute and shifts over the years. Therefore one might want to make a variable that is the age of the house at sales.

# handling Year features
ames_raw$yrs_since_remod <- ames_raw$`Yr Sold` - ames_raw$`Year Remod/Add`

# Total Living Area
ames_raw$TotalArea <- ames_raw$`Gr Liv Area` + ames_raw$`Total Bsmt SF`

# TotalBath
ames_raw$TotalBath <- ames_raw$`Bsmt Full Bath` + 0.5 * ames_raw$`Bsmt Half Bath` + ames_raw$`Full Bath` + 0.5 * ames_raw$`Half Bath`

5. Transformation

When the predictor is right skewed they tend to distort the linear model by exhibiting leverage points. Taking a log will resolve such a problem.

p=ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)
p2 <- ggMarginal(p, margins = 'both', fill="skyblue", size=4,type="histogram")


p4=ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10()+scale_x_log10()
p3 <- ggMarginal(p4, margins = 'both', fill="skyblue", size=4,type="histogram")

gridExtra::grid.arrange(p2,p3,ncol=2)

6. Scaling, centering and normalizing.

For linear regression models, centering and scaling does not change the model itself, but they change the interpretability of the model coefficients. Converting all the predictors on a similar scale has its advantage because the size of the coefficient will directly indicate the influence of the predictor. For some hierarchical models, scaling will also help with the convergence problem. But scaling is critical for all the distance-based methods you will encounter later in the semester.

7. Discretization

Categorical variables need to be coded appropriately. Dummy coding or one-hot-encoding is one way when the information is nominal. Take, for example, the building type variable by default, it’s a character variable with five values.

table(ames_raw$`Bldg Type`)
## 
##   1Fam 2fmCon Duplex  Twnhs TwnhsE 
##   2425     62    109    101    233

One can use contextual information to convert them into meaningful variables like a single family and multiple families or a shared house. Or use factor, which will, by default, make a dummy coding.

ames_raw$BldgTypeFact<-factor(ames_raw$`Bldg Type`)
head(model.matrix(~BldgTypeFact,ames_raw))
##   (Intercept) BldgTypeFact2fmCon BldgTypeFactDuplex BldgTypeFactTwnhs
## 1           1                  0                  0                 0
## 2           1                  0                  0                 0
## 3           1                  0                  0                 0
## 4           1                  0                  0                 0
## 5           1                  0                  0                 0
## 6           1                  0                  0                 0
##   BldgTypeFactTwnhsE
## 1                  0
## 2                  0
## 3                  0
## 4                  0
## 5                  0
## 6                  0

By default, R will convert the factors into a dummy, with the first level being the baseline. It’s essential to know how a dummy variable is included in a model as it is model specific.

8. Grouping

Not all categorical variable needs a unique category. One might consider grouping some categories so that you have fewer categories to model. For example, the overall condition is rated from 1 to 10, as shown below.

ggplot(ames_raw)+geom_histogram()+aes(x=`Overall Cond`)

It’s important to know which way is better. For the Ames data it is infact 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor

One could convert them into integers since there is explicit ordering. However, the distribution of the variable is uneven, with many observations at five and very few below 5. In such a case, combining the categories into three may be better since the data does not seem to have the resolution to understand the ten levels.

ames_raw$OverallCond3 <- ifelse( ames_raw$`Overall Cond` >5, 3, ifelse( ames_raw$`Overall Cond`<5, 1, 2))
ggplot(ames_raw)+geom_histogram()+aes(x=OverallCond3)

9. Selecting and compressing

There are various reasons why you need to be selective of what to include. This could be the lack of information from the variable due to the limitations posed by the sample size, contextual reasons, or overlapping information.

  • If there is very small variability in some variable, it’s very unlikely that you will get some differetiating information out of them.

For highly correlated variables you might select variables so that correlation does not impact the model building.

# Correlation matrix
numeric_vars = ames_raw %>% select(where(is.numeric)) 
ggcorr(numeric_vars,hjust = 1)

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

Alternatively, you could compress the correlated variable using dimension reduction. However, it’s no free lunch since you need to do all the scaling and missing data processing before you can apply PCA and you need to decide how many components to include. pcaMethods package offers a way to fit a model even in the presence of missing data.

BiocManager::install("pcaMethods")
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'pcaMethods'
library(pcaMethods)

pca_numeric<-pcaMethods::pca(numeric_vars,nPcs=20,scale="uv")
summary(pca_numeric)
## nipals calculated PCA
## Importance of component(s):
##                  PC1     PC2     PC3    PC4     PC5     PC6     PC7     PC8
## R2            0.2393 0.07939 0.06822 0.0543 0.04926 0.04631 0.03258 0.02859
## Cumulative R2 0.2393 0.31870 0.38692 0.4412 0.49048 0.53679 0.56937 0.59796
##                   PC9    PC10    PC11    PC12    PC13    PC14    PC15    PC16
## R2            0.02797 0.02671 0.02583 0.02512 0.02463 0.02379 0.02296 0.02282
## Cumulative R2 0.62593 0.65264 0.67848 0.70359 0.72822 0.75201 0.77497 0.79780
##                  PC17    PC18    PC19    PC20
## R2            0.02162 0.02113 0.01923 0.01748
## Cumulative R2 0.81941 0.84054 0.85977 0.87725
plot(pca_numeric)

slplot(pca_numeric)

Tidymodels

In tidymodels package there is a feature engineering method called recipes. It used to be it’s own package but now they have murged the packages. You can find the example for Ames Housing here: https://www.tmwr.org/recipes

Doing things like - Take log of Gr_Liv_Area - Make Neighborhood that is less than 1% prevalent into “other” - Dummy code all the nominal predictors

Can be easily done as

library(tidymodels)
simple_ames <- 
  recipe(SalePrice ~ Neighborhood + `Gr Liv Area` + `Year Built` + `Bldg Type`,
         data = ames_raw) %>%
  step_log(`Gr Liv Area`, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors())

However, this is not executed. You will use the framework of tidymodel workflow, which will run these transformation when fitting the model.

In class work

Model fitting

Since you’ve worked on it in MA679 please copy and paste your best model here.

lmfit_2008<- lm(log(SalePrice) ~ log(`Lot Area`)+`Year Built`+`Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` +  log(`Gr Liv Area`) + `Bsmt Full Bath`  + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF` , data = ames_raw_2008) 
summary(lmfit_2008)
## 
## Call:
## lm(formula = log(SalePrice) ~ log(`Lot Area`) + `Year Built` + 
##     `Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + log(`Gr Liv Area`) + 
##     `Bsmt Full Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + 
##     `Open Porch SF`, data = ames_raw_2008)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48256 -0.08492  0.00058  0.09723  0.68601 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.573e+00  5.456e-01 -10.214  < 2e-16 ***
## log(`Lot Area`)     8.383e-02  1.094e-02   7.661 3.57e-14 ***
## `Year Built`        2.898e-03  2.298e-04  12.614  < 2e-16 ***
## `Year Remod/Add`    3.078e-03  3.153e-04   9.762  < 2e-16 ***
## `Total Bsmt SF`     1.588e-04  1.967e-05   8.076 1.50e-15 ***
## `1st Flr SF`       -4.433e-05  2.275e-05  -1.949  0.05150 .  
## log(`Gr Liv Area`)  6.828e-01  2.934e-02  23.276  < 2e-16 ***
## `Bsmt Full Bath`    6.376e-02  1.007e-02   6.331 3.35e-10 ***
## `TotRms AbvGrd`    -3.040e-02  5.274e-03  -5.764 1.02e-08 ***
## `Garage Area`       2.378e-04  2.996e-05   7.938 4.41e-15 ***
## `Wood Deck SF`      1.167e-04  3.937e-05   2.964  0.00309 ** 
## `Open Porch SF`    -1.455e-04  7.253e-05  -2.005  0.04513 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1721 on 1306 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8178, Adjusted R-squared:  0.8163 
## F-statistic: 532.9 on 11 and 1306 DF,  p-value: < 2.2e-16
##

Your answer:

Please write your answer in full sentences.

Please perform feature engineering on the Ames housing data that you think will help with the prediction.

##
##

Your answer:

Please write your answer in full sentences.

Compare the result before and after the feature engineering step.

##
##

Your answer:

Please write your answer in full sentences.

AutoML

Feature engineering is mostly about context. But does it matter if the prediction is of interest? Is there automatic ways to do all of this that is better? Let’s find out.

Include all the vairables you included as well as anything you want to add to the model.

vars <- c("SalePrice","Lot Area","Gr Liv Area","Full Bath")
#vars <- c("SalePrice")#
train <- ames_raw_2008[, vars]
test  <- ames_raw_2009[, vars]
colnames(train) <- make.names(colnames(train))
colnames(test)  <- make.names(colnames(test))

# mlr3 TaskRegr
train$SalePrice <- log(train$SalePrice)
test$SalePrice <- log(test$SalePrice)

AutoML with MLR3

MLR3 has an auto ML.
https://github.com/a-hanf/mlr3automl/blob/master/vignettes/mlr3automl.md https://a-hanf.github.io/useR_presentation/useR_2021_mlr3automl.pdf

You will need to install

devtools::install_github('https://github.com/mlr-org/mlr3extralearners')
devtools::install_github('https://github.com/a-hanf/mlr3automl', dependencies = TRUE)
# load packages and data
library(mlr3)
library(mlr3learners)
library(mlr3automl)

# Create a task
task <- as_task_regr(train, target ="SalePrice",id = "ames_raw")

# Auto ML
ames_model = AutoML(task)
train_indices = sample(1:task$nrow, 2/3*task$nrow)
ames_model$train(row_ids = train_indices)

predict_indices = setdiff(1:task$nrow, train_indices)
predictions = ames_model$predict(row_ids = predict_indices)
plot(predictions$response,predictions$truth);abline(0,1)
resampling_result = ames_model$resample()
iml_explainer = ames_model$explain(iml_package = "iml")
iml_pdp = iml::FeatureEffect$new(iml_explainer, feature="Lot.Area",  method="pdp")
plot(iml_pdp)

H2O autoML

h2o autoML is well known in the field as something pretty powerful. https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

# load packages and data
library(h2o)

# init h2o
h2o.init()
h2o.no_progress()

# upload the data
train_hf <- as.h2o(train)
test_hf <- as.h2o(test)

# fit a model
automl <- h2o.automl(y = "SalePrice", training_frame = train_hf, max_runtime_secs = 300)
model <- automl@leader
predictions=h2o.predict(model,newdata = test_hf)
plot( as.vector(predictions$predict),as.vector(test_hf$SalePrice));abline(0,1)
cor( as.vector(predictions$predict),as.vector(test_hf$SalePrice))

# shutdown h2o

h2o.shutdown(prompt =F) 

automl

From CRAN: Fits from simple regression to highly customizable deep neural networks either with gradient descent or metaheuristic, using automatic hyper parameters tuning and custom cost function. A mix inspired by the common tricks on Deep Learning and Particle Swarm Optimization.

https://cran.r-project.org/web/packages/automl/index.html

library(automl)
amlmodel = automl_train_manual(Xref = train,
                               Yref = train$SalePrice
                               %>% as.numeric(),
                               hpar = list(learningrate = 0.01,
                               minibatchsize = 2^2,
                               numiterations = 600))
prediction = automl_predict(model = amlmodel, X = test) 
plot(prediction,test$SalePrice);abline(0,1)

autoxgboost

XG Boost is a popular implementation of gradient boosting method that we will talk about in MA679. Leaving aside the detail, it’s another popular ML method that has a lot of tuning parameters. AutoXGBoost is a function that would search for good choice of these parameters automaticall.

# load library
devtools::install_github("ja-thomas/autoxgboost")
library(autoxgboost)
# create a classification task
trainTask = makeRegrTask(data = train, target = "SalePrice")
# create a control object for optimizer
ctrl = makeMBOControl()
ctrl = setMBOControlTermination(ctrl, iters = 5L) 
# fit the model
res = autoxgboost(trainTask, control = ctrl, tune.threshold = FALSE)
# do prediction and print confusion matrix
prediction = predict(res, test)
plot(prediction$data[,1],prediction$data[,2]);abline(0,1)
#caret::confusionMatrix(test$Species, prediction$data$response)

forester

Forester is similar to Autoxgboost in a way it fits tree based models automatically. They can fit xgboost as well as it’s cousins like catboost.

#install.packages("devtools")
#devtools::install_github("ModelOriented/forester")
library(forester)
best_model <- forester::train(data = train, 
                       y = "SalePrice", 
                       type = "auto")

Missing Data

Working with Missing Data

NA and other types

In R, missing data is represented using NA. But other closely related values are treated the same in some applications. You can read about it in detail here.

?NA
?NULL
?NaN
?Inf

You need to be careful how the data is represented as missing in the original data and perform appropriate conversions.

Since NA is not a number, you need to use is.na() to find out if a value is NA or not.

x<-c(1,2,NA,3,4,NA)
is.na(x)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

If you want to know the elements that are not NA you can add !

!is.na(x)
## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE

Some easy handling of NA

The problem with NA is that it’s a logical value but without a definition of operations. Simple functions like mean and medians will all return NA if you have any NA in the vector.

x<-c(rnorm(10),NA)
mean(x)
## [1] NA
median(x)
## [1] NA

You can remove NA and calculate these values manually. But for base R functions, there is a parameter na.rm that does the same for you.

mean(x,na.rm = T)
## [1] -0.3222407
mean(x[!is.na(x)])
## [1] -0.3222407
median(x,na.rm = T)
## [1] -0.03489692
median(x[!is.na(x)])
## [1] -0.03489692

Types of analysis

  • Available case analysis: is when you use the data based on their availability.
  • Complete case analysis: is when you remove all the rows that has any missingness.

What does R do when you have missing data?

R does available case analysis by default.

Let’s generate fake x and y and creat a missing version of x.

x <- rnorm(100)
y <- 1+1.4*x+rnorm(100)
xmiss<-x
xmiss[sample(1:100,10)]<-NA

Compare the results.

display(lm(y~x))
## lm(formula = y ~ x)
##             coef.est coef.se
## (Intercept) 1.07     0.11   
## x           1.51     0.12   
## ---
## n = 100, k = 2
## residual sd = 1.09, R-Squared = 0.61
display(lm(y~xmiss))
## lm(formula = y ~ xmiss)
##             coef.est coef.se
## (Intercept) 1.05     0.12   
## xmiss       1.50     0.13   
## ---
## n = 90, k = 2
## residual sd = 1.11, R-Squared = 0.61

How about using more than one predictor?

x1<-rnorm(100)
x2<-rnorm(100)
y12<-1+1.4*x1-0.4*x2+rnorm(100)
x1miss<-x1
x2miss<-x2
x1miss[1:10] <-NA
x2miss[11:20]<-NA
display(lm(y12~x1+x2))
## lm(formula = y12 ~ x1 + x2)
##             coef.est coef.se
## (Intercept)  1.04     0.11  
## x1           1.24     0.11  
## x2          -0.38     0.10  
## ---
## n = 100, k = 3
## residual sd = 1.07, R-Squared = 0.62
display(lm(y12~x1miss+x2miss))
## lm(formula = y12 ~ x1miss + x2miss)
##             coef.est coef.se
## (Intercept)  1.01     0.12  
## x1miss       1.17     0.12  
## x2miss      -0.42     0.10  
## ---
## n = 80, k = 3
## residual sd = 1.04, R-Squared = 0.62
display(lm(y12~x1))
## lm(formula = y12 ~ x1)
##             coef.est coef.se
## (Intercept) 1.05     0.11   
## x1          1.31     0.12   
## ---
## n = 100, k = 2
## residual sd = 1.14, R-Squared = 0.56
display(lm(y12~x1miss))
## lm(formula = y12 ~ x1miss)
##             coef.est coef.se
## (Intercept) 1.07     0.12   
## x1miss      1.27     0.13   
## ---
## n = 90, k = 2
## residual sd = 1.17, R-Squared = 0.54

What does ggplot do with NA?

x<-c(rpois(1000,50),rep(0,100))
logx_na<-ifelse(log(x)==-Inf,NA,log(x))
gp1=ggplot(data.frame(x))+geom_histogram()+aes(x=x)
gp2=ggplot(data.frame(x))+geom_histogram()+aes(x=log(x))
gp3=ggplot(data.frame(x))+geom_histogram()+aes(x=logx_na)
gp4=ggplot(data.frame(x))+geom_histogram()+aes(x=log(x+1))
gridExtra::grid.arrange(gp1,gp2,gp3,gp4,ncol=4)
## Warning: Removed 100 rows containing non-finite values (`stat_bin()`).
## Removed 100 rows containing non-finite values (`stat_bin()`).

Missing Data Mechanisms

Three Missing Data Mechanisms

  • Missing Completely at Random (MCAR): Missing data can be regarded as a simple random sample of the complete data.
  • Missing At Random (MAR): Missingness is related to the observed data but not on the missing data
  • Missing Not At Random (MNAR): Missingness is related to the missing values them selves.

MCAR mechanism

x <- rnorm(100)
y <- 1+1.4*x+rnorm(100)

Visualizing

When you have missing observations, it helps to understand the extent of missingness.

library(naniar)
vis_miss(riskfactors)

If the missingness is too severe, you should first question the use of that variable.

The next step in the visual assessment will be to see if there are specific patterns in the way data are missing. You will see these patterns when a group of questions are all related to the same issue.

library(UpSetR)
gg_miss_upset(riskfactors)

You can further drill down on the pattern of missings in each column, broken down by a categorical variable from the dataset using gg_miss_fct.

gg_miss_fct(x = riskfactors, fct = marital)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `marital = (function (x) ...`.
## Caused by warning:
## ! `fct_explicit_na()` was deprecated in forcats 1.0.0.
## ℹ Please use `fct_na_value_to_level()` instead.
## ℹ The deprecated feature was likely used in the naniar package.
##   Please report the issue at <https://github.com/njtierney/naniar/issues>.

When GGPlot ignores the NAs, you can add them back using geom_miss_point. This allows you to see if there is a pattern in the missing data that varies by covariates.

x<-rnorm(1000)
y<- c(rnorm(900),rep(NA,100))
ggplot(data.frame(x))+geom_point()+aes(x=x,y=y)
## Warning: Removed 100 rows containing missing values (`geom_point()`).

ggplot(data.frame(x))+geom_point()+aes(x=x,y=y)+geom_miss_point()
## Warning: Removed 100 rows containing missing values (`geom_point()`).

Note that where the NAs are plotted is a little concerning since it’s arbitrary and often leads to confusion. But at the exploration phase, it’s better than ignoring them.

Simple Imputation strategies

x <- rnorm(100)
xmiss<-x
xmiss[sample(1:100,10)]<-NA

Mean imputation

x_imp_mean <- xmiss
x_imp_mean[is.na(x_imp_mean)]<-mean(x_imp_mean,na.rm=TRUE)
x_imp_median <- xmiss
x_imp_median[is.na(x_imp_mean)]<-median(x_imp_mean,na.rm=TRUE)

Last value carried forward

na.locf <- function(x) {
  v <- !is.na(x)
  c(NA, x[v])[cumsum(v)+1]
}
x_imp_lvcf<-na.locf(xmiss)

Indicator for missing ness + mean imputation

x_imp_mean <- xmiss
x_imp_mean[is.na(x_imp_mean)]<-mean(x_imp_mean,na.rm=TRUE)
x_miss_index<-1*is.na(x_imp_mean)

New category for the missing value

x_cat<- sample(c("A","B","C"),100,TRUE)
x_cat[sample(1:100,10)]<-NA
x_cat_imp<-x_cat
x_cat_imp[is.na(x_cat_imp)]<-"D"

Random Imputation

Simple random imputation

Take a random sample from the observed values and impute.

random.imp <- function (a){
    missing <- is.na(a)
    n.missing <- sum(missing)
    a.obs <- a[!missing]
    imputed <- a
    imputed[missing] <- sample (a.obs, n.missing, replace=TRUE)
    return (imputed)
}
x_imp_rand_simple<-random.imp(xmiss)

Regression based imputation

Fit regression model on the observed and impute the predicted value. - Deterministic: Use the predicted value - Random: Add random noise

x <- rnorm(100)
y <- 1+1.4*x+rnorm(100)
ymiss<-y
ymiss[sample(1:100,10)]<-NA
lm.fit.model<-lm(ymiss~x)
y_imp_det <-ymiss
y_imp_det[is.na(y_imp_det)]<- predict (lm.fit.model,newdata=data.frame(x=x[is.na(ymiss)]))
y_imp_rand <-ymiss
y_imp_rand[is.na(y_imp_rand)]<- rnorm (sum(is.na(ymiss)), predict (lm.fit.model,newdata=data.frame(x=x[is.na(ymiss)])), sigma.hat (lm.fit.model))
par(mfrow=c(1,3))
plot(x,y,col=1+1*is.na(ymiss),pch=19,main="Original")
plot(x,y_imp_det,col=1+1*is.na(ymiss),pch=19,main="Deterministic")
plot(x,y_imp_rand,col=1+1*is.na(ymiss),pch=19,main="Random")

Iterative imputation

Regression based imputation can be repeated many times to update the imputation.

x1<-rnorm(100)
x2<-rnorm(100)
y12<-1+1.4*x1-0.4*x2+rnorm(100)
y12miss     <- y12
x1miss      <- x1
x2miss      <- x2
x1miss[1:10] <-NA
x2miss[11:20]<-NA
y12miss[5:15]<-NA

rand.reg.imp <- function (dat.y, dat.x){
    missing <- is.na(dat.y)
    n.missing <- sum(missing)
    dat.y.obs <- dat.y[!missing]
    imputed <- dat.y
    lm.fit.model <- lm(dat.y~.,data.frame(dat.y,dat.x))
    imputed[missing] <- rnorm (n.missing, 
                               predict (lm.fit.model,newdata=data.frame(dat.x[missing,])), 
                               sigma.hat (lm.fit.model))
    return (imputed)
}

misdat<-data.frame(y=y12miss,x1=x1miss,x2=x2miss)
fildat<-apply(misdat,2,random.imp)
for(j in 1:100){
  for(i in 1:3){
    fildat[,i]<-rand.reg.imp(misdat[,i],fildat[,-i])
  }
}

Multiple imputation

Multiple imputation is a way of imputing multiple times to account for the uncertainty in the imputation.

mi

mi has been my personal favorite since I was the original developer. It uses multiple imputations using chained equations. This was the first package that thought carefully about diagnostics and convergence of the multiple imputations.

mdf <- missing_data.frame(misdat)
imp_mi <- mi(mdf, seed = 335)
summary(imp_mi)
## $y
## $y$is_missing
## missing
## FALSE  TRUE 
##    89    11 
## 
## $y$imputed
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.175343 -0.370213 -0.002965  0.006035  0.346768  1.127787 
## 
## $y$observed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.15354 -0.29616  0.04957  0.00000  0.32849  1.23374 
## 
## 
## $x1
## $x1$is_missing
## missing
## FALSE  TRUE 
##    90    10 
## 
## $x1$imputed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7697 -0.1334  0.1395  0.1742  0.4191  1.2886 
## 
## $x1$observed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.23361 -0.40282  0.07882  0.00000  0.36343  1.01490 
## 
## 
## $x2
## $x2$is_missing
## missing
## FALSE  TRUE 
##    90    10 
## 
## $x2$imputed
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.975021 -0.526084 -0.278845 -0.263816 -0.006778  0.446671 
## 
## $x2$observed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.94084 -0.32633 -0.03362  0.00000  0.41717  1.34881
mi_compdat<-mi:::complete(imp_mi,2)
plot(imp_mi)

mice

Mice is considered the most used software in this space. It predates mi, but it copied many functions from mi and is very similar to mi.

library(mice)
imp_mice <- mice(misdat, m=5, maxit = 50, method = 'pmm', seed = 500)
## 
##  iter imp variable
##   1   1  y  x1  x2
##   1   2  y  x1  x2
##   1   3  y  x1  x2
##   1   4  y  x1  x2
##   1   5  y  x1  x2
##   2   1  y  x1  x2
##   2   2  y  x1  x2
##   2   3  y  x1  x2
##   2   4  y  x1  x2
##   2   5  y  x1  x2
##   3   1  y  x1  x2
##   3   2  y  x1  x2
##   3   3  y  x1  x2
##   3   4  y  x1  x2
##   3   5  y  x1  x2
##   4   1  y  x1  x2
##   4   2  y  x1  x2
##   4   3  y  x1  x2
##   4   4  y  x1  x2
##   4   5  y  x1  x2
##   5   1  y  x1  x2
##   5   2  y  x1  x2
##   5   3  y  x1  x2
##   5   4  y  x1  x2
##   5   5  y  x1  x2
##   6   1  y  x1  x2
##   6   2  y  x1  x2
##   6   3  y  x1  x2
##   6   4  y  x1  x2
##   6   5  y  x1  x2
##   7   1  y  x1  x2
##   7   2  y  x1  x2
##   7   3  y  x1  x2
##   7   4  y  x1  x2
##   7   5  y  x1  x2
##   8   1  y  x1  x2
##   8   2  y  x1  x2
##   8   3  y  x1  x2
##   8   4  y  x1  x2
##   8   5  y  x1  x2
##   9   1  y  x1  x2
##   9   2  y  x1  x2
##   9   3  y  x1  x2
##   9   4  y  x1  x2
##   9   5  y  x1  x2
##   10   1  y  x1  x2
##   10   2  y  x1  x2
##   10   3  y  x1  x2
##   10   4  y  x1  x2
##   10   5  y  x1  x2
##   11   1  y  x1  x2
##   11   2  y  x1  x2
##   11   3  y  x1  x2
##   11   4  y  x1  x2
##   11   5  y  x1  x2
##   12   1  y  x1  x2
##   12   2  y  x1  x2
##   12   3  y  x1  x2
##   12   4  y  x1  x2
##   12   5  y  x1  x2
##   13   1  y  x1  x2
##   13   2  y  x1  x2
##   13   3  y  x1  x2
##   13   4  y  x1  x2
##   13   5  y  x1  x2
##   14   1  y  x1  x2
##   14   2  y  x1  x2
##   14   3  y  x1  x2
##   14   4  y  x1  x2
##   14   5  y  x1  x2
##   15   1  y  x1  x2
##   15   2  y  x1  x2
##   15   3  y  x1  x2
##   15   4  y  x1  x2
##   15   5  y  x1  x2
##   16   1  y  x1  x2
##   16   2  y  x1  x2
##   16   3  y  x1  x2
##   16   4  y  x1  x2
##   16   5  y  x1  x2
##   17   1  y  x1  x2
##   17   2  y  x1  x2
##   17   3  y  x1  x2
##   17   4  y  x1  x2
##   17   5  y  x1  x2
##   18   1  y  x1  x2
##   18   2  y  x1  x2
##   18   3  y  x1  x2
##   18   4  y  x1  x2
##   18   5  y  x1  x2
##   19   1  y  x1  x2
##   19   2  y  x1  x2
##   19   3  y  x1  x2
##   19   4  y  x1  x2
##   19   5  y  x1  x2
##   20   1  y  x1  x2
##   20   2  y  x1  x2
##   20   3  y  x1  x2
##   20   4  y  x1  x2
##   20   5  y  x1  x2
##   21   1  y  x1  x2
##   21   2  y  x1  x2
##   21   3  y  x1  x2
##   21   4  y  x1  x2
##   21   5  y  x1  x2
##   22   1  y  x1  x2
##   22   2  y  x1  x2
##   22   3  y  x1  x2
##   22   4  y  x1  x2
##   22   5  y  x1  x2
##   23   1  y  x1  x2
##   23   2  y  x1  x2
##   23   3  y  x1  x2
##   23   4  y  x1  x2
##   23   5  y  x1  x2
##   24   1  y  x1  x2
##   24   2  y  x1  x2
##   24   3  y  x1  x2
##   24   4  y  x1  x2
##   24   5  y  x1  x2
##   25   1  y  x1  x2
##   25   2  y  x1  x2
##   25   3  y  x1  x2
##   25   4  y  x1  x2
##   25   5  y  x1  x2
##   26   1  y  x1  x2
##   26   2  y  x1  x2
##   26   3  y  x1  x2
##   26   4  y  x1  x2
##   26   5  y  x1  x2
##   27   1  y  x1  x2
##   27   2  y  x1  x2
##   27   3  y  x1  x2
##   27   4  y  x1  x2
##   27   5  y  x1  x2
##   28   1  y  x1  x2
##   28   2  y  x1  x2
##   28   3  y  x1  x2
##   28   4  y  x1  x2
##   28   5  y  x1  x2
##   29   1  y  x1  x2
##   29   2  y  x1  x2
##   29   3  y  x1  x2
##   29   4  y  x1  x2
##   29   5  y  x1  x2
##   30   1  y  x1  x2
##   30   2  y  x1  x2
##   30   3  y  x1  x2
##   30   4  y  x1  x2
##   30   5  y  x1  x2
##   31   1  y  x1  x2
##   31   2  y  x1  x2
##   31   3  y  x1  x2
##   31   4  y  x1  x2
##   31   5  y  x1  x2
##   32   1  y  x1  x2
##   32   2  y  x1  x2
##   32   3  y  x1  x2
##   32   4  y  x1  x2
##   32   5  y  x1  x2
##   33   1  y  x1  x2
##   33   2  y  x1  x2
##   33   3  y  x1  x2
##   33   4  y  x1  x2
##   33   5  y  x1  x2
##   34   1  y  x1  x2
##   34   2  y  x1  x2
##   34   3  y  x1  x2
##   34   4  y  x1  x2
##   34   5  y  x1  x2
##   35   1  y  x1  x2
##   35   2  y  x1  x2
##   35   3  y  x1  x2
##   35   4  y  x1  x2
##   35   5  y  x1  x2
##   36   1  y  x1  x2
##   36   2  y  x1  x2
##   36   3  y  x1  x2
##   36   4  y  x1  x2
##   36   5  y  x1  x2
##   37   1  y  x1  x2
##   37   2  y  x1  x2
##   37   3  y  x1  x2
##   37   4  y  x1  x2
##   37   5  y  x1  x2
##   38   1  y  x1  x2
##   38   2  y  x1  x2
##   38   3  y  x1  x2
##   38   4  y  x1  x2
##   38   5  y  x1  x2
##   39   1  y  x1  x2
##   39   2  y  x1  x2
##   39   3  y  x1  x2
##   39   4  y  x1  x2
##   39   5  y  x1  x2
##   40   1  y  x1  x2
##   40   2  y  x1  x2
##   40   3  y  x1  x2
##   40   4  y  x1  x2
##   40   5  y  x1  x2
##   41   1  y  x1  x2
##   41   2  y  x1  x2
##   41   3  y  x1  x2
##   41   4  y  x1  x2
##   41   5  y  x1  x2
##   42   1  y  x1  x2
##   42   2  y  x1  x2
##   42   3  y  x1  x2
##   42   4  y  x1  x2
##   42   5  y  x1  x2
##   43   1  y  x1  x2
##   43   2  y  x1  x2
##   43   3  y  x1  x2
##   43   4  y  x1  x2
##   43   5  y  x1  x2
##   44   1  y  x1  x2
##   44   2  y  x1  x2
##   44   3  y  x1  x2
##   44   4  y  x1  x2
##   44   5  y  x1  x2
##   45   1  y  x1  x2
##   45   2  y  x1  x2
##   45   3  y  x1  x2
##   45   4  y  x1  x2
##   45   5  y  x1  x2
##   46   1  y  x1  x2
##   46   2  y  x1  x2
##   46   3  y  x1  x2
##   46   4  y  x1  x2
##   46   5  y  x1  x2
##   47   1  y  x1  x2
##   47   2  y  x1  x2
##   47   3  y  x1  x2
##   47   4  y  x1  x2
##   47   5  y  x1  x2
##   48   1  y  x1  x2
##   48   2  y  x1  x2
##   48   3  y  x1  x2
##   48   4  y  x1  x2
##   48   5  y  x1  x2
##   49   1  y  x1  x2
##   49   2  y  x1  x2
##   49   3  y  x1  x2
##   49   4  y  x1  x2
##   49   5  y  x1  x2
##   50   1  y  x1  x2
##   50   2  y  x1  x2
##   50   3  y  x1  x2
##   50   4  y  x1  x2
##   50   5  y  x1  x2
imp_mice$imp # Imputed values
## $y
##             1           2           3          4           5
## 5   0.1915959 -0.40519236  1.39881020  1.8676865  0.30939834
## 6   2.7447702 -0.43398485  1.98581978  1.7757237  1.64084122
## 7   0.3436375  1.64084122  0.56529989 -1.8747174  1.55497973
## 8   2.8354757  4.85985712  1.86768654  5.0739152  1.36313727
## 9   0.7918676  1.39981641 -0.95888615  0.1915959  1.67065916
## 10 -1.1578936  0.15156745 -0.02748227 -1.8747174  2.14273524
## 11 -1.0167486  1.64084122  1.55497973  1.5549797 -1.87471735
## 12  1.3696749  1.42082456 -0.09988091  1.9847025  1.98470245
## 13  2.3478799 -0.09988091 -0.43398485  1.0567470 -0.02748227
## 14 -3.1780567  0.31393114 -0.49478648 -0.3951807 -1.15789359
## 15  2.4006078  2.40060777  2.40060777  2.4825946  1.82065185
## 
## $x1
##              1            2          3           4          5
## 1   0.29747424  0.391915163  0.6524310  0.25986764  0.1673495
## 2   0.36433700  0.364336999  1.2176289  0.81349533  0.2737471
## 3   1.52410399  0.836374444  0.4616355  0.83637444  1.5241040
## 4   1.22916446 -0.428969938  0.4616355  0.33699024  0.4616355
## 5  -0.66064519 -2.205324320  0.2737471  0.46163550 -1.0980212
## 6   1.31793154 -0.424439009  1.2176289 -0.47865037  1.1968380
## 7   0.08178517  0.461635504 -0.4649633 -1.05335840  0.6176239
## 8   1.03709719  2.002095744  1.2285925  2.32524759  0.9737953
## 9   0.65243100  0.728737433 -2.0820366  0.08178517  0.8314716
## 10 -1.78130180  0.002099191 -1.1435303 -1.05335840  0.4227256
## 
## $x2
##             1           2           3          4          5
## 11 -1.4373575 -0.89689823 -1.74376829 -0.4912884  0.2941387
## 12 -0.5089588  1.30038045  0.02755044 -0.7667865 -0.5089588
## 13 -0.3159401 -0.34441789 -0.01226730  0.2522217 -0.7690259
## 14  0.7377375 -0.49128840 -0.47478500  1.3003804 -0.5551897
## 15 -1.8966227 -0.47478500 -0.05078550 -0.7143761  3.1301698
## 16 -0.5152477 -0.04783456 -1.91444934  0.3942609  0.4804277
## 17  1.1940832 -0.49128840 -0.54307869 -1.8966227 -0.5430787
## 18 -1.9144493  0.52847962  0.50635150  1.3483236  1.3483236
## 19 -0.2463615 -0.83081880 -1.89449232  2.3939096  0.1604178
## 20  0.2844143 -0.93861180 -0.71437609 -1.9802127  0.1115891
comp_mice <- complete(imp_mice,2) # Filled in matrix

Machine Learning Imputation

Machine learning based imputation is pretty popular nowadays. missForest is one example of using a random forest algorithm for imputation.

missForest

library(missForest)
imp_forest<- missForest(misdat)

Timeseries

We will not go too much into time series imputation, but it also comes up fairly frequently.

imputeTS

More information can be found here: https://cran.rstudio.com/web/packages/imputeTS/vignettes/Cheat_Sheet_imputeTS.pdf

library("imputeTS")
ggplot_na_distribution(tsAirgap)

imp <- na_kalman(tsAirgap)
ggplot_na_imputations(tsAirgap, imp)